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bilistic clustering, their clustering results are sensitive to the presence of outliers in the data. Even a few 
outliers can compromise the ability of these algorithms to identify meaningful hidden structures rendering 
their outcome unreliable. This paper develops robust clustering algorithms that not only aim to cluster 
the data, but also to identify the outliers. The novel approaches rely on the infrequent presence of outliers 
in the data which translates to sparsity in a judiciously chosen domain. Capitalizing on the sparsity in 
the outlier domain, outlier-aware robust K-means and probabilistic clustering approaches are proposed. 

> 

psj ■ Their novelty lies on identifying outliers while effecting sparsity in the outlier domam through carefully 

ly^ ' chosen regularization. A block coordinate descent approach is developed to obtain iterative algorithms 

I with convergence guarantees and small excess computational complexity with respect to their non-robust 

, counterparts. Kernelized versions of the robust clustering algorithms are also developed to efficiently 

handle high-dimensional data, identify nonlinearly separable clusters, or even cluster objects that are not 
represented by vectors. Numerical tests on both synthetic and real datasets validate the performance and 
I applicability of the novel algorithms. 

- - -' Index Terms 

(Block) coordinate descent, clustering, expectation-maximization algorithm, Group-Lasso, kernel 
methods, K-means, mixture models, robustness, sparsity. 



Part of this work was presented at the 36th IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, Prague, Czech 
Republic, May 2011. Work was in part supported by NSF grant CCF-1016605; Dr. Kekatos was funded by the European 
Community's Seventh Framework Programme (FP7/2008 under grant agreement No. 234914). The authors are with the ECE 
Dept., University of Minnesota, Minneapolis, MN 55455, USA, Emails: {foreiO02,kekatos,georgios}@umn.edu 



January 20, 2013 



DRAFT 



2 



I. Introduction 

Clustering aims to partition a set of data into subsets, called clusters, such that data assigned to the 
same cluster are similar in some sense. Working with unlabeled data and under minimal assumptions 
makes clustering a challenging, yet universal tool for revealing data structures in a gamut of applications 
such as DNA microarray analysis and bioinformatics, (social) network analysis, image processing, and 
data mining IIBTI . |[T4l . Moreover, clustering can serve as a pre-processing step for supervised learning 
algorithms in applications where labeling data one-at-a-time is costly. Multiple interpretations across 
disciplines of what a cluster is, have led to an abundant literature of application-specific algorithms lOTI . 

Among the algorithms which cluster data represented by vectors, K-means and Gaussian mixture 
model (GMM-)based clustering are two popular schemes ||22| . |[3T| . Conventional K-means relies on 
the Euclidean distance as a similarity measure, thereby yielding partitions that minimize the within- 
cluster scatter lfT4l . Contrastingly, soft (a.k.a. fuzzy) K-means is tailored to identify overlapping clusters 
by allowing each datum to belong to multiple clusters 01. GMM-based clustering considers observed 
data drawn from a probability density function (pdf) following a GMM, where each class-conditional 
pdf corresponds to a cluster lOTI . Clustering arises as a by-product of a maximum likelihood (ML) 
estimation framework for the GMM parameters. ML parameter estimates are typically obtained through 
the expectation-maximization (EM) algorithm Q- Kernel methods have been devised to enable clustering 
of nonlinearly separable clusters ll26l . ||25| . 

Notwithstanding their popularity, K-means and GMM-based clustering are sensitive to inconsistent 
data, termed outliers, due to their functional dependency on the Euclidean distance |[T6l . Outliers appear 
infrequently in the data, emerging either due to reading errors or because they belong to rarely-seen and 
hence, markedly informative phenomena. However, even a few outliers can render clustering unreliable: 
cluster centers and model parameter estimates can be severely biased, and thus the data-to-cluster 
assignment is deteriorated. This motivates robustifying clustering approaches against outliers at affordable 
computational complexity in order to unravel the underlying structure in the data. 

Robust clustering approaches to clustering have been investigated. In HI and |[T5l . an additional cluster 
intended for grouping outliers is introduced with is centroid assumed equidistant from all non-outlying 
data. Possibilistic clustering measures the so-called typicality of each datum with respect to (wrt) each 
cluster to decide whether a datum is an outlier ll20l . ll24l . However, possibilistic clustering is sensitive 
to initialization and can output the same cluster more than once. Clustering approaches originating from 
robust statistics, such as the minimum volume ellipsoid and Huber's e-contaminated model-based methods 
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HH, ll34l . extract one cluster at a time. This deflation approach can hinder the underlying data structure 
by removing elements before seeking other clusters. Other approaches rooted on robust statistics are 
based on the ^i-distance (K-medians), Tukey's biweighted function, and trimmed means ||3l, |[T9l . |[T2l . 
ll32l : but are all limited to linearly separable clusters. 

The first contribution of the present work is to introduce a data model for clustering that explicitly 
accounts for outliers via a deterministic outlier vector per datum (Section Jill. A datum is deemed an 
outlier if its corresponding outlier vector is nonzero. Translating the fact that outliers are rare to sparsity 
in the outlier vector domain leads to a neat connection between clustering and the compressive sampling 
(CS) paradigm ||5l. Building on this model, an outlier-aware clustering methodology is developed for 
clustering both from the deterministic (K-means), and the probabilistic (GMMs) perspectives. 

The second contribution of this work comprises various iterative clustering algorithms developed for 
robust hard K-means, soft K-means, and GMM-based clustering (Section Hill) . The algorithms are based 
on a block coordinate descent (BCD) iteration and yield closed-form updates for each set of optimization 
variables. In particular, estimating the outliers boils down to solving a group-Lasso problem Il33l . whose 
solution is computed in closed form. The novel robust clustering algorithms operate at an affordable 
computational complexity of the same order as the one for their non-robust counterparts. 

Several contemporary applications in bioinformatics, (social) network analysis, image processing, 
and machine learning call for outlier-aware clustering of high-dimensional data, or involve nonlinearly 
separable clusters. To accommodate these clustering needs, the novel algorithms are kernelized in Section 
HVt and this is the third contribution of our work. The assumed model not only enables such a kemelization 
for both K-means and the probabilistic setups, but it also yields iterative algorithms with closed-form 
updates. In Section |Vl the algorithms developed are tested using synthetic as well as real datasets from 
handwritten digit recognition systems and social networks. The results corroborate the effectiveness of 
the methods. Conclusions are drawn in Section |Vll 

Notation: Lower-(upper-)case boldface letters are reserved for column vectors (matrices), and calli- 
graphic letters for sets; (•)-^ denotes transposition; Nat the set of naturals {1, . . . , A^}; Op (Ip) the p x 1 
vector of all zeros (ones); Ip the p x p identity matrix; diag(xi, . . . , Xp) a. p x p diagonal matrix with 
diagonal entries xi, . . . , Xp, range(X) the range space of matrix X; E[-] denotes the expectation operator; 
AA(x; m, S) denotes the multivariate Gaussian pdf with mean m and covariance matrix S evaluated at 
x; ||x||a := Vx^Ax for a positive semidefinite matrix A; ||x||p := (XlILi kil^)^^^ for p > 1 stands for 
the ^r,-norm in R". 
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II. Sparsity-Aware Clustering: Context and Criteria 



After reviewing the clustering task, a model pertinent to outlier-contaminated data is introduced 
next. Building on this model, robust approaches are developed for K-means (Section III-AI) as well as 
probabilistic clustering (Section III-BI ). 

A. K-means Clustering 

Given a set of p-dimensional vectors X := {xi, . . . ,XAr}, let {Xi, . . . , Xc} be a partition of A' to C 
subsets (clusters) Xc C X for c € Nc, which are collectively exhaustive, mutually exclusive, and non- 
empty. Partitional clustering seeks a partition of X such that two vectors assigned to the same cluster are 
closer to each other in some well-defined sense, such as the Euclidean distance, than to vectors assigned 
to other clusters. 

Among partitional clustering methods, K-means is one of the most widely used with well-documented 
merits and a long history H. In the K-means setup, a centroid uic G is introduced per cluster Xc- Then, 
instead of comparing distances between pairs of points in X, the point-centroid distances ||x„ — mc||2 are 
considered. Moreover, for each input vector x„, K-means introduces the unknown memberships u„c for 
c G Nc, defined to be 1 when x„ S X^ and otherwise. To guarantee a valid partition, the membership 
coefficients apart from being binary (cl): n„c G {0,1}; they should also satisfy the constraints (c2): 
X]^=i Unc > 0, for all c, to preclude empty clusters; and (c3): Y^c=i ~ ^' ™' ^^^^ ^^'^^ 

vector is assigned to a cluster. 

The K-means clustering task can be then posed as that of finding the centroids {md^x ^^'^ cluster 
assignments Unc& by solving the optimization problem 



However, problem ^ is known to be NP-hard, even for C = 2 Q. Practically, a suboptimal solution 
is pursued using the celebrated K-means algorithm. This algorithm drops the (c2) constraint, which is 
checked in a post-processing step instead. Then, it alternately minimizes the cost in ([T]) wrt one of the 
set of variables {md and {n„c}, while keeping the other one fixed, and iterates. K-means iterations are 
guaranteed to converge to a stationary point of dl} ||28l . 

To better motivate and further understand the pros and cons of K-means clustering, it is instructive 
to postulate a pertinent data model. Such a model assumes that the input vectors can be expressed 
as x„ = Y^^^i nncHic + v„, where v„ is a zero-mean vector capturing the deviation of x„ from its 
associated centroid rric. It is easy to see that under (cl)-(c3), the minimizers of ([T|l offer merely a blind 
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least-squares (LS) fit of the data {x„}^^^ respecting the cluster assignment constraints. However, such a 
simplistic, yet widely applicable model, does not take into account outliers; that is points x,i violating the 
assumed model. This fact paired with the sensitivity of the LS cost to large residuals explain K-means' 
vulnerability to outliers |18|. 

To robustify K-means, consider the following data model which explicitly accounts for outliers 

c 

Xn = Uncin-c + On + Vn, n E Nat (2) 

c=l 

where the outlier vector o„ is defined to be deterministically nonzero if x„ corresponds to an outlier, and 
Op otherwise. The unknowns {uncT^cOn} in © can now be estimated using the LS approach as the 

2 

minimizers of X]^=i ^n— Ylc=i ''^ncT^c—On , which are the maximum likelihood (ML) estimates if 
M{0,a\). Even if Unc's were known, estimating {md and {o„} based solely on {x„} would be 
an under-determined problem. The key observation here is that most of the {o„} are zero. This motivates 
the following criterion for clustering and identification of at most s E Nat outliers 

N 

s. to J]l(||o„||2 > 0) < s (3) 



N 

min y 

M,0,UgWi ^ 
n=l 



C 



9 n=l 



c=l 

where M := [mi • • • uic], O := [oi • • • otv], U E M^^*^ denotes the membership matrix with entries 
[U]„^c := y-nc, is the set of all U matrices satisfying (cl) and (c3), and I(-) denotes the indicator 
function. Due to (cl) and (c3), each summand in the cost of Q can be rewritten as X]^i^™c||xn — 
iHc — o„|||; and the Lagrangian form of © as 

N C N 

Mo'^UeWi ^^Unc\\^n-rac-On\\l + \^l{\\Onh>^) (4) 

' ' n=l c=l n=l 

where A > is an outlier-controlling parameter. For A = 0, o„ should be set equal to the generally 
nonzero value x„ — nic for any c and yield a zero optimum cost, in which case all x„'s are declared as 
outliers. When A — oo, the optimum o„'s are zero, all the x„'s are deemed outlier free, and the problem 
in dU) reduces to the K-means cost in ([1]). This reduction of the NP-hard K-means problem to an instance 
of the problem in dD establishes the NP-hardness of the latter. 

Along the lines of K-means, similar iterations could be pursued for suboptimally solving dD. However, 
such iterations cannot provide any convergence guarantees due to the discontinuity of the indicator 
function at zero. Aiming at a practically feasible solver of (jj]), consider first that U E Z^i is given. The 
optimization wrt {M, O} remains non-convex due to X]^=i Idl^nlb > 0)- Following the successful CS 
paradigm, where the ^o-(pseudo)norm of a vector x E M^, defined as ||x||o := ^n=i^i\^n\ > 0), was 
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surrogated by its convex £i-norm ||x||i, the problem in (|4]| is replaced by 

N C N 




^ ^ ^n„c ||xn - nic - o„||2 + 

^ n=l c=l n=l 



o 



'nil 2 • 



(5) 



Our robust K-means approach is to minimize dSj, which is convex in {M, O}, but remains jointly 
non-convex. The algorithm for suboptimally solving the non-convex problem in (|5]l is postponed for 
Section IIII-AI Note that the minimization in ([5]) resembles the group Lasso criterion used for recovering 
a block-sparse vector in a linear regression setup []33l. This establishes an interesting link between robust 
clustering and CS. A couple of remarks are now in order 

Remark 1 (Mahalanobis distance). If the covariance matrix of v„ in ^ is known, say S, the Euclidean 
distance in I©-© can be replaced by the Mahalanobis distance ||x„— rric— o„|||,_i. 

Remark 2 (£i-penalty for entry-wise outliers). The regularization term J2n=i 11°" II 2 ® enables iden- 
tifying whole data vectors as outliers. Replacing it by X]^=i l|onlli enables recovery of outlying data 
entries instead of the whole vector Iterative solvers for this case can be developed using the methodology 
presented in Section |llll due to space limitations this case is not pursued here. 

Constraints (cl) and (c3) in ([T|) entail hard membership assignments, meaning that each vector is 
assigned to a single cluster However, soft clustering which allows each vector to partially belong to 
several clusters, can better identify overlapping clusters ID. One way to obtain fractional memberships is 
via soft K-means. Soft K-means differs from hard K-means by (i) relaxing the binary-alphabet constraint 
(cl) to the box constraint (c4): Unc G [0, 1]; and (ii) by raising the n„c's in ([Hi to the q-th power, where 

> 1 is a tuning parameter |[T]. The robust soft K-means scheme proposed here amounts to replacing x„ 
with its outlier-compensated version (x^ — o„), and leveraging the sparsity of the {o„}'s. These steps 
lead to the following criterion 



where U2 is the set of all U matrices satisfying (c3)-(c4). An algorithm for approximately solving Q 
is presented in Section IIII-AI Note that a hard partition of X can still be obtained from the soft Unc by 
assigning x„ to the c-th cluster, where c := argmaXcUnc- 

B. Probabilistic Clustering 

An alternative way to perform soft clustering is by following a probabilistic approach lISTI . To this 
end, a mixture distribution model is postulated for x„, while {unc}c=i ^low interpreted as unobserved 



TV C 



mm 
M,o,ue^/2 




(6) 



71=1 C=l 
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(latent) random variables. The centroids {md^^ are treated as deterministic parameters of the mixture 
distribution, and their ML estimates are subsequently obtained via the EM algorithm. 

To account for outliers, probabilistic clustering is generalized to model Q. Suppose that the {x„}'s in 
dlj) are i.i.d. drawn from a mixture model where the {o„}'s are deterministic parameters. The memberships 
Un := [uni---UncV latent random vectors, corresponding to the rows of U, and take values in 
{ei, . . . , ec}, where ec is the c-th column of Ic- If x„ is drawn from the c-th mixture component, 
then u„ = ec- Assume further that the class-conditional pdf's are Gaussian and modeled as p(x„|u.„ = 
Cc) = A/'(x„; mc+o„, XI) for all n and c. This implies that p(x„) = (x„; mc+o„, S) with 

TTc := Pr(u„ = ec). If the x„'s are independent, the log-likelihood of the input data is 



where X := [xi • • • x^r], and tt := [tti • • • vrc] . Controlling the number of outliers (number of zero o„ 
vectors) suggests minimizing the regularized negative log-likelihood as 

N 

mm-L(X;&) + X^\\On\\-s^i (8) 

n=l 

where := {tt S 7^,M,0,5] >- 0} is the set of all model parameters, V is the probability simplex 
V := {it : 7r^l=l and tt > 0}, and S ^ means that SI is a positive definite matrix. An EM- 
based algorithm for solving <^ is derived in Section IIII-B I Having estimated the parameters of the 
likelihood, the posterior probabilities := Pr(un = ec|x„) can be readily obtained and interpreted as 
soft memberships. 

Although modeling all class conditionals having a common covariance matrix Xl may seem restrictive, 
it guarantees that the GMM is well-posed, thereby avoiding spurious unbounded likelihood values ||2] 
p. 433]. Specifically, it is easy to see that even if all o„'s are set to zero, the log-likelihood of a GMM with 
different covariance matrices Sc per mixture grows unbounded, e.g., by setting one of the rric's equal to 
an x„ and letting Xlc — ^ for that particular c. This possibility for unboundedness is also present in dSj, 
and justifies the use of a common S. But even with a common covariance matrix, the vectors o„ can drive 
the log-likelihood in (|7]) to infinity. Consider for example, any (mc,o„) pair satisfying x„ = nic + o„ 
and let S — )• 0. To make the problem of maximizing L (X; 0) well-posed, the ||o„||5._i regularizer is 
introduced. Note also that for A — > oo, the optimal O is zero and ([8]l reduces to the conventional MLE 
estimation of a GMM; whereas for A — >• 0, the cost in ([8]) becomes unbounded from below. 
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III. Robust Clustering Algorithms 

Algorithms for solving the problems formulated in Section O are developed here. Section ITlI-AI focuses 
on the minimization of Q, while the minimization in ^ is obtained from ^ for q = 1. In Section 
IIII-B[ an algorithm for minimizing ([8]l is derived based on the EM approach. Finally, modified versions 
of the new algorithms with enhanced resilience to outliers are pursued in Section IIII-CI 



A. Robust ( Soft) K-Means Algorithms 

Consider first solving ^ for q > 1. Although the cost in Q is jointly nonconvex, it is convex 
wrt each of M, O, and U. To develop a suboptimum yet practical solver, the aforementioned per- 
variable convexity prompted us to devise a BCD algorithm, which minimizes the cost iteratively wrt 
each optimization variable while holding the other two variables fixed. Let M^*^ 0*^*^ and U^*) denote 
the tentative solutions found at the t-th iteration. Also, initialize U^*^) randomly in U2, and O^*^) to zero. 

In the first step of the t-th iteration, Q is optimized wrt M for U = U^*^^) and O = O^*^^). The 
optimization decouples over the rric's, and every rric*^ is the closed-form solution of an LS problem as 



N 



(n(*-i))''fx„,-o(*-i) 



m W = ^ ^ . (9) 



N 

(t-l)W 



n=l 



In the second step, the task is to minimize Q wrt O for U = U^* ^) and M = M^*). The optimization 

problem decouples per index n, so that each o„ can be found as the minimizer of 

c 

<AWK) := ^{ut'^y (||x„ - mW - oJl + Allo^lb) . (10) 

c=l 

The cost (o„) is convex but non-differentiable. However, its minimizer can be expressed in closed 
form as shown in the ensuing proposition. 



Proposition 1. The optimization problem in (IIOI ) is uniquely minimized by 

A 



1 



^||rn II2 



(11) 



where := max{x,0}, and r^*^ is defined as 

/ ,(t-i) 



rW ■= '■ ^ (12) 
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Proof: Since X]^i(^"c)'^ > ^'^'^ ^ (^^3)' "^^e first summand of (/)(*)(o„) in (ITOl ) 

is a strictly convex function of On- Hence, (o„) is a strictly convex function too and its minimizer 
is unique. Then, recall that a vector oi*-* is a minimizer of (ITOl ) if and only if G 00*^*-' (on-*), where 
9(/)(*)(o.„) is the sub-differential of (o„). For o„ ^ 0, where the cost in ([TOl l is differentiable, d<f>^^\on) 
is simply the gradient — 2^^]^(unc ^^)''^ — rric — ^1 + 211^7-^^ At o„ = 0, the sub-differential 
of the ^2-norm ||o„||2 is the set of vectors {v„ : ||v„||2 < 1} by definition, and then the sub-differential 
of 0W(o„) is a,/.W(o„) = {-2Ef=i(^^nc"'^)^ (x„ - m, - |v„) : ||v„||2 < l}. 
When the minimizer o^*^ is nonzero, the condition G d(j)^^\on^) implies 

^ ' ^oW=rW (13) 



^\\On II2, 



where r^i*'' has been defined in ([T2l ). Equation ([T3] ) reveals that o^*'' is a positively scaled version of r^*^ 
The scaling can be readily found by taking the ^2-norm on both sides of ([T3T l. i.e., \\on^ II2 = \\rn^ II2 — A/2, 
which is valid for ||rl*^||2 > A/2. Substituting this back to ( fT3l ). yields o^*^ = r^*^ ( 1 — ^ 



(t) 



2||r: 

For o^i*'' = 0, there exists a v^*^ for which ||v^i*^||2 < 1 and v^*^ = (2/A)r^*\ This is possible when 
||rn''||2 < A/2. These two cases for the minimizer of ([TOl i are compactly expressed via ([TT]) . ■ 

The update for On^ in (fTTI ) reveals two interesting points: (i) the cost (f)^^\on) indeed favors zero 
minimizers; and (ii) the number of outliers is controlled by A. After updating vector rn\ its norm is 
compared against the threshold A/2. If ||rn^||2 exceeds A/2, vector x„ is deemed an outlier, and it is 
compensated by a nonzero On\ Otherwise, o^*^ is set to zero and x„ is clustered as a regular point. 

During the last step of the t-th iteration, Q is minimized over U G ZY2 for M = M(*) and O = O^. 
Similar to the conventional soft K-means, the minimizer is available in closed form as |[T] 



/iix„-m<"-ol"||| + A||o!?| 



E 



2 



c'=l \ ll^n — m^' —On II2 + A||0„ ||2_ 



(14) 



Regarding the robust hard K-means, a similar BCD approach for solving ^ leads to updating M^*) 
and O^*) via and (fTT])-(fT2l) for q = 1. Updating U^*) boils down to the minimum-distance rule 



1 • II W Wi 

1 , c = argmm^, ||x„ - my - On \\2 



, otherwise 



Note that ([TS]) is the limit case of (O for g 1+. 

The robust K-means (RKM) algorithm is tabulated as Algorithm [T] RKM is terminated when ||M(*) — 
M^*^^) ||ir/||M(*^ \\p < €s, where || • \\f denotes the Frobenius norm of a matrix, and is a small positive 
threshold, e.g., = 10"^. The computational resources needed by RKM are summarized next. 
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Algorithm 1 Robust K-means 

Require: Input data matrix X, number of clusters C, q > I, and A > 0. 
1: Initialize O^*^^ to zero and U^*^) randomly in U2- 
2: for t = 1,2, . . . do 
3: Update M(*) via 
4: Update 0(*) via (HB-lUll). 
5: Update U(*) via ^ {q > 1) or ^ (q = 1). 
6: end for 



Remark 3 {Computational complexity of RKM). Suppose for concreteness that: (asl) the number of 
clusters is small, e.g., C < p; and (as2) the number of points is much larger than the input dimension, 
i.e., N ^ p. When (as2) does not hold, a modification of RKM is developed in Section |lVl Under (asl)- 
(as2), the conventional K-means algorithm performs 0{NCp) scalar operations per iteration, and requires 
storing 0{Np) scalar variables. For RKM, careful counting shows that the per iteration time-complexity 
is maintained at 0{NCp): ([141 ) requires computing the NC Euclidean distances ||x„ — rric*"^^ — o^*~^^ \\^ 
and the N norms ||o^f~^^||2 which is 0{NCp); m^*^'s are updated in 0{NCp); while ([IB-CIl) entail 
0{NCp) operations. Further, the memory requirements of RKM are of the same order as those for 
K-means. Note also that the additional N x p matrix O can be stored using sparse structures. 

The RKM iterations are convergent under mild conditions. This follows because the sequence of cost 
function values is non-increasing. Since the cost is bounded below, the function value sequences are 
guaranteed to converge. Convergence of the RKM iterates is characterized in the following proposition. 

Proposition 2. The RKM algorithm for q > I converges to a coordinate-wise minimum of Moreover, 
the hard RKM algorithm (q = 1) converges to a local minimum of ([5]l. 

Proof: By defining /s(c) as being zero when the Boolean argument c is true, and oo otherwise, the 
problem in © can be written in the unconstrained form 

N C 

^^%^^<c{\\^n-mc-On\\l + X\\On\\2) + fsi'^ eU2). (16) 

' ' n=l c=l 

The cost in ( fT6l ). call it /(M, 0,U), is a proper and lower semi-continuous function, which implies 
that its non-empty level sets are closed. Also, since / is coercive, its level sets are bounded. Hence, the 
non-empty level sets of / are compact. For q > 1, function /(M, 0,U) has a unique minimizer per 
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optimization block variable M, O, and U. Then, convergence of the RKM algorithm to a coordinate-wise 
minimum point of Q follows from ll29l Th. 4.1(c)]. 

When q = 1, define the first summand in ( fT6l ) as /o(M, O, U) := J2n=i Ylc=i ^nc||x„ — rric — o„|||, 
which is the differentiable part of /. Function /o has an open domain, and the remaining non-differentiable 
part of / is separable wrt the optimization blocks. Hence, again by ||29l Th. 4.1(c)], the RKM algorithm 
with q = I converges to a local minimum (M*,0*,U*) of i^. 

It has been shown so far that for q = I, a BCD iteration converges to a local minimum of Q. The 
BCD step for updating U is the hard rule in ([TST l. Hence, this BCD algorithm (i) yields a U* with binary 
entries, and (ii) essentially implements the BCD updates for solving ([5]). Since a local minimum of dill 
with binary assignments is also a local minimum of the claim of the proposition follows. ■ 



B. Robust Probabilistic Clustering Algorithm 

An EM approach is developed in this subsection to carry out the minimization in ([8]l. If U were known, 
the model parameters could be estimated by minimizing the regularized negative log-likelihood of the 
complete data (X,U); that is, 

TV 



mm 
© 

n=l 



N C 



L(X,U;0) + A J]||o„,||s-i (17) 

where 

L(X,U;0) ■.= Y^Y^Unc{\ogTic + \ogN {^n]m^ + On,J:)) . (18) 

71=1 C=l 

But since U is not observed, the cost in ([TTl l is suboptimally minimized by iterating the two steps of the 
EM method. Let 0^*^ denote the model parameter values at the t-th iteration. During the E-step of the t-th 
iteration, the expectation Q{@; 0(*~^)) := Eu|X;0(*-i) [L (X, U; 0)] is evaluated. Since L (X, U; 0) is 
a linear function of U, and u„c's are binary random variables, it follows that 

N C 

Q(0; 0(*-i)) = Y.Y1 ^nl (log + log (x„; m, + o„, S)) (19) 

n=l c=l 

where 7ni := Pr (u.„ = ec|x„; 0(*~^)). Using Bayes' rule, the posterior probabilities 7^*] are evaluated 
in closed form as 

li; _ - ■ V--,., c , , — , 



E?=iVrrV(x„;mr^)+or^),S(-i)^ 



During the M-step, 0*-*'' is updated as 

N 

0W=argmin - Q(0; 0(*-^)) + A J] ||o„||s-i- (21) 



^ n=l 
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A BCD strategy that updates each set of the parameters in © one at a time with all other ones fixed, is 
described next. First, the cost in (j2T) is minimized wrt tt. Given that Y^^=i Inl = 1 for all n, it is easy 
to check that the minimizer of — Z]^=i S^i Inl log tTc over V is found in closed form as 

1 ^ 

''c^ = J^Y.^nl for all cGNc. (22) 

n=l 

Subsequently, (|2T]) is minimized wrt M while tt, O, and S are set respectively to 7r(*\ 0(*~^\ and 
centroids are updated as the minimizers of a weighted LS cost yielding 

Z^n=l7nc l-X-n 0„ I 

= ^ — W) "^^^^^ ^^^^ 

Z]n=l 7nc 

Then, (|2T]) is minimized wrt O while keeping the rest of the model parameters fixed to their already 
updated values. This optimization decouples over n, and one has to solve 

c At) 

min^ -^||x„ - m(*) - OnW^^^a-D^-i + A||o„||(s(t-i))-i (24) 

c=l 

for all n G Nat. For a full covariance I], (l24l) can be solved as a second-order cone program. For the 
case of spherical clusters, i.e., S = a'^Ip, solving (l24l ) simplifies considerably. Specifically, the cost can 
then be written as Ylc'=i'ync\\^n — mi*^ — o„||| + 2Acj^*~^'' ||o„||2, which is similar to the cost in (ITOl ) 
for q = I, and for an appropriately scaled A. Building on the solution of (fTOl) . the o„'s are updated as 

Aa(*-i)' 



lr-W|l 
Tn 2 



(25) 



after redefining the residual vector as rl*^ := X]^i7"c(xn — mi*^) in lieu of (fT2l) . Interestingly, the 
thresholding rule of (|25l l shows that cj^*"^) affects the detection of outliers. In fact, in this probabilistic 
setting, the threshold for outlier identification is proportional to the value of the outlier-compensated 
standard deviation estimate and, hence, it is adapted to the empirical distribution of the data. 

The M-step is concluded by minimizing (|2T]) wrt S for tt = tt^*), M = M(*\ and O = O^*^ i.e., 

N C (t) TV 

^11-" - ^'^^ - + y logdetS + A lloWb-.. (26) 

n=l c=l n=l 



For a generic S, (1261 ) must be solved numerically, e.g., via gradient descent or interior point methods. 
Considering spherical clusters for simplicity, the first order optimality condition for (|26l ) requires solving 
a quadratic equation in cj^*^ . Ignoring the negative root of this equation, cj^*^ is found as 

N 



^ n=l 



N C / ^ N 



\ igg^"2l|x„ - m?' - o!?||l + llofflbj . (27) 
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Algorithm 2 Robust probabilistic clustering 

Require: Input data matrix X, number of clusters C, and parameter A > 0. 
1: Randomly initialize M(°), tt^o) G V, and set S(°) = 5lp (cj(°) = \/5) for (5 > 0, and 0(°) to zero. 
2: for t = 1,2, . . . do 
3: Update ^nl via for all n, c. 
4: Update 7r(*) via (l22l ). 
5: Update M(*) via (l23]l. 
6: Update O^*) by solving (El ((l25]l). 
7: Update (ct^*)) via dig ((l27])). 
8: end for 



The robust probabilistic clustering (RPC) scheme is tabulated as Algorithmic For spherical clusters, its 
complexity remains 0{NCp) operations per iteration, even though the constants involved are larger than 
those in the RKM algorithm. Similar to RKM, the RPC iterations are convergent under mild conditions. 
Convergence of the RPC iterates is established in the next proposition. 

Proposition 3. The RPC iterations converge to a coordinate-wise minimum of the log-likelihood in ([7]). 

Proof: Combining the two steps of the EM algorithm, namely (l20b and (|2TI ). it is easy to verify that 
the algorithm is equivalent to a sequence of BCD iterations for optimizing 

N C / Kf ( _L ^ 

- E E log ^^-^ "^^ + M +A E l|o»lli:-+^(r £ ^2)+/.(7r £ V)+f.[^ y 0) 

n=l c=l ^ '^^'^ ^ n=l 

(28) 

where 0' := {tt, M, O, S}, the iV x C matrix V has entries \^\n,c '■= Inc > 0, and as in (fT6l ) that /s(c) 
is zero when condition c is true, and oo otherwise. That the {7„c} are positive follows after using Bayes' 
rule to deduce that oc tTcM (x„; rric + o„, S) and noticing that (i) M (x,i; rric + o^i, XI) is positive 
for all x^i, and (ii) all tTc must be positive so that the cost in ( |28] ) remains finite. 

The objective function of this minimization problem is proper, bounded below, and lower semi- 
continuous implying, that its non-empty level sets are closed. Since this function is also coercive, its 
level sets are bounded. Hence, its non-empty level sets are compact. Moreover, the objective function 
has a unique minimizer for the optimization blocks tt, M, and O. In particular, the M block minimizer 
is unique since J2n=i 7"c > 0, for all c G Nc- Then, by |[29l Th. 4.1 (c)], the RPC algorithm converges 
to a coordinate- wise minimum point of d?]). ■ 
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Proposition [3] guarantees that the RPC iterations converge. However, since each non-differentiable term 
1 1 o„ 1 1 5^ -1 involves two different optimization variables S and o„, the BCD iteration can be trapped at 
a coordinate-wise local minimum, which is not necessarily a local minimum of ((S]). Once the iterations 
have converged, the final 7nc's can be interpreted as soft cluster assignments, whereby hard assignments 
can be obtained via the maximum a posteriori detection rule, i.e., x„ G A!c for c = argmaxc' 7„c'- 

Remark 4 (Selecting X). Tuning A is possible if additional information, e.g., on the percentage of outliers, 
is available. The robust clustering algorithm is ran for a decreasing sequence of A values {Xg}, using 
"warm starts" iTT], until the expected number of outliers is identified. When solving for A^, warm start 
refers to the optimization variables initialized to the solution obtained for Ag_i. Hence, running the 
algorithm over {A^} becomes very efficient, because few BCD iterations per Xg suffice for convergence. 

C. Weighted Robust Clustering Algorithms 

As already mentioned, the robust clustering methods presented so far approximate the discontinuous 
penalty I(||o„||2 > 0) by ||o„||2, mimicking the CS paradigm in which I(|x| > 0) is surrogated by the 
convex function However, it has been argued that non-convex functions such as log(|x| + e) for a 
small e > can offer tighter approximants of I(|x| > 0) ll30l . This rationale prompted us to replace 
||on||2 in ©, and by the penalty log(||o„||2 + e) to further enhance block sparsity in o„'s, and 
thereby improve resilience to outliers. 

Altering the regularization term modifies the BCD algorithms only when minimizing wrt O. This 

particular step remains decoupled across o„'s, but instead of the (^(*)(o„) defined in ( fTOl ). one minimizes 

c 

•/-l^^K) := j;(4*c"'^)'' (l|xn - - o„||i + Alog (||o„||2 + e)) (29) 

c=l 

that is no longer convex. The optimization in (|29l ) is performed using a single iteration of the majorization- 
minimization (MM) approact^ ITTI . The cost 0^^(o„) is majorized by a function /(*)(o„; o^* ^^), which 
means that (p^w\on) < /^'^On; oi*"^^) for every o„ and (l)w\on) = /(*)(o„; ol*""^^) when o„ = o'h'^^ 
Then /W(o„;oi*"'^ ) is minimized wrt o„ to obtain oi*^ 

To find a majorizer for 0^^(o„), the concavity of the logarithm is exploited, i.e., the fact that logx < 
logXo + x/xo — 1 for any positive x and Xq. Applying the last inequality for the penalty and ignoring 

'Note that the MM approach for minimizing (f)w\on) at the t-th BCD iteration involves several internal MM iterations. Due 
to the external BCD iterations and to speed up the algorithm, a single MM iteration is performed per BCD iteration t. 
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the constant terms involved, we end up minimizing 

/WK;ori)) := Y^i^t'^y - - ^nWl + A«||o„|b) (30) 

c=l 

where An"* := A/(||oi* ^^||2 + e). Comparing (|30] | to (ITOl ) shows that the new regularization results in a 
weighted version of the original one. The only difference between the robust algorithms presented earlier 
and their henceforth termed weighted counterparts is the definition of A. At iteration t, larger values for 
1 1 On ^^Ib lead to smaller thresholds in the thresholding rules (cf. ([TT]) . (|25])). thereby making o„ more 
likely to be selected as nonzero. The weighted robust clustering algorithms initialize On^ to the associated 
o„ value the non-weighted algorithm converged to. Thus, to run the weighted RKM for a specific value 
of A, the RKM needs to be run first. Then, weighted RKM is run with all the variables initialized to the 
values attained by RKM, but with the An ^ as defined earlier. 

The MM step combined with the BCD algorithms developed hitherto are convergent under mild 
assumptions. To see this, note that the sequences of objective values for the RKM and RPC algorithms 
are both non-increasing. Since the respective cost functions are bounded below, those sequences are 
guaranteed to converge. Characterizing the points and speed of convergence goes beyond the scope of 
this paper 

IV. Clustering High-Dimensional and Nonlinearly Separable Data 

The robust clustering algorithms of Section |IIl] are kemelized here. The advantage of kemelization is 
twofold: (i) yields computationally efficient algorithms when dealing with high-dimensional data, and (ii) 
robustly identifies nonlinearly separable clusters. 

A. Robust K-means for High-Dimensional Data 

The robust clustering algorithms derived so far involve generally 0{NCp) operations per iteration. 
However, several applications entail clustering relatively few but high-dimensional data in the presence 
of outliers. In imaging applications, one may wish to cluster N = 500 images of say p = 800 x 600 = 
480, 000 pixels; while in DNA microarray data analysis, some tens of (potentially erroneous or rarely 
occurring) DNA samples are to be clustered based on their expression levels over thousands of genes 
|[T4l . In such clustering scenarios where N, an efficient method should avoid storing and processing 
p-dimensional vectors I127il . To this end, the algorithms of Section |lll] are kemelized here ll25l . It will 
be shown that these kemelized algorithms require 0{N^C) operations per iteration and 0{N'^) space; 
hence, they are preferable when p > N"^. This kernelization not only offers processing savings in the 
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high-dimensional data regime, but also serves as the building module for identifying nonlinearly separable 
data clusters as pursued in the next subsection. 

We focus on kernelizing the robust soft K-means algorithm; the kemelized robust hard K-means can 
then be derived after simple modifications. Consider the N xC matrix Ug with entries [Ug]nc := Unc, and 
the Grammian K := X^X formed by all pairwise inner products between the input vectors. Even though 
the cost for computing K is 0{N'^p), it is computed only once. Note that the updates (ITTI ). and ([141 ) 
involve inner products between the p-dimensional vectors {o„, r„}^^]^, and {md^^. If {vj G R^}?^]^ is 
a pair of any of these vectors, the cost for computing v^V2 is clearly 0{p). But if at every BCD iteration 
the aforementioned vectors lie in range(X), i.e., if there exist {w^j G M^}?^]^ such that {vj = Xwj}^^^, 
then v^V2 = w^Kw2, and the inner product can be alternatively calculated in 0{N'^). 

Hinging on this observation, it is first shown that all the p x 1 vectors involved indeed lie in range (X). 
The proof is by induction: if at the {t — l)-st iteration every o^* G range(X) and U*-*"^-* G U2, it will 
be shown that oi*\ m^c \ updated by RKM he in range (X) as well. 

Suppose that at the t-th iteration, the matrix U(*~^) defining Ug* is in U2, while there exists matrix 
A(*~i) such that 0^*"^^ = XA^*"^). Then, the update of the centroids in Q can be expressed as 

mW = (X - 0(*-i))U(*-i) diag-i((U(*-i))^l^) = XBW (31) 

where 

bW := (I;v - A(*-i))U(*~^) diag-i((U(*-i))^l;v). (32) 

Before updating 0^*\ the residual vectors {r„} must be updated via ([T2l l. Concatenating the residuals 
in R(*) := [r[*^ . . . rf^], the update in (fT2l ) can be rewritten in matrix form as 

rW = X - M(*)(U(*-^))^diag-i(U(*-i)lc) = XAW (33) 

where 

aW := I;v - BW(u(*-i))^diag-nu(*-i)lc). (34) 
From ([TTI ). every o^*^ is a scaled version of r^*^ and the scaling depends on ||ri*'||2. Based on (1331 ). the 



latter can be readily computed as ||rl*''||2 = y (5n^)^K(5^i*-' = ||(5n^||K, where 5n^ stands for the n-th 
column of A^*). Upon applying the thresholding operator, one arrives at the update 

0(*) = XA(*) (35) 

where the n-th column of A^*) is given by 



a 



(*) = 



2 On K 



Vn. (36) 
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Having proved the inductive step by (1351 ). the argument is complete if and only if the outlier variables 
O are initialized as 0^°) = XA(°) for some A*^ including the practically interesting and meaningful 
initialization at zero. The result just proved can be summarized as follows. 

Proposition 4. By choosing 0^°) = XA(°) for any A(°) e M^^^ and U(°) G U2, the columns of the 
matrix variables O, M, and R updated by RKM all lie in range(X); i.e., there exist known A^*^ B^*^ 
and AW, such that O^*) = XA^, = XB^*), and R(*) = XA^*) for all t. 

What remains to be kernelized are the updates for the cluster assignments. For the update step (fT4l) 
or ([TST l. we need to compute ||x„ — mi*^ — oi*'*!!! llon^lb- Given that x„ = Xe„, where e„ denotes 
the n-th column of I^v, and based on the kernelized updates (OTI ) and (1351 ). it is easy to verify that 

||x. - mW - oW||i = ||X(e„ - /jW - = ||e„ - (3^ - (37) 

for every n and c, where /3i*'' is the c-th column of B^*). As in ( [35l ). it follows that 

||oW||2 = ||XaW||2 = ||aW||j^. (38) 

The kernelized robust K-means (KRKM) algorithm is summarized as Algorithm |3] As for RKM, the 
KRKM algorithm is terminated when ||M(*) -M(*~i)||f/||M(*) \\f < for a small > 0. Based on ^ 
and exploiting standard linear algebra properties, the stopping condition can be equivalently expressed 

as (Eti llM*^ - llic)/(E?=i llM*^ Ilic) < 

Notice that this kernelized algorithm does not explicitly update M, R, or O; actually, these variables 

are never processed. Instead, it updates A, B, and A; while the clustering assignments are updated via 

([T4l ). (l37l ). and (l38l ). Ignoring the cost for finding K, the computations required by this algorithm are 

0{N^C) per iteration, whereas the stored variables A, B, A, U^, and K occupy 0{N'^) space. Note 

that if the centroids M are explicitly needed (e.g., for interpretative purposes or for clustering new input 

data ), they can be acquired via (OTI l after KRKM has terminated. 

B. Kernelized RKM for Nonlinearly Separable Clusters 

One of the limitations of conventional K-means is that clusters should be of spherical or, more generally, 
ellipsoidal shape. By postulating the squared Euchdean distance as the similarity metric between vectors, 
the underlying clusters are tacitly assumed to be linearly separable. GMM-based clustering shares the 
same limitation. Kernel K-means has been proposed to overcome this obstacle |[26l by mapping vectors 
x„ to a higher dimensional space H through the nonlinear function ip : W ^ T-L. The mapped data 
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Algorithm 3 Kernelized RKM 

Require: Grammian matrix K ;^ 0, number of clusters C, q > I, and A > 0. 
1: Initialize U^'^^ randomly in U2, and A^*^) to zero. 
2: for t = 1,2, . . . do 
3: Update B(*) from 
4: Update A(*) from (|34] |. 
5: Update A(*) from dMj. 

6: Update U(*) and U^*^ from ([Hi or dB), and 
7: end for 



{ip{^n)}n=i the so-termed feature space which is of dimension P > p or even infinite. The 

conventional K-means algorithm is subsequently applied in its kernelized version on the transformed 
data. Thus, linearly separable partitions in feature space yield nonlinearly separable partitions in the 
original space. 

For an algorithm to be kemelizable, that is to be able to operate with inputs in feature space, the inner 
product between any two mapped vectors, i.e., (^^(x„)(/9(xm), should be known. For the non-kemelized 
versions of K-means, RKM, and RFC algorithms, where the linear mapping (/5(x„) = x„ can be trivially 
assumed, these inner products are directly computable and stored at the (n, m)-th entry of the Grammian 
matrix K = X^X. When a nonlinear mapping is used, the so-termed kernel matrix K with entries 
[K]„ „i := (/5^(x„)(/3(xm) replaces the Grammian matrix and must be known. By definition, K must be 
positive semidefinite and can be employed for (robust) clustering, even when (^(x„) is high-dimensional 
(cf. Section HV-AI) . infinite-dimensional, or even unknown lITOl . 

Of particular interest is the case where 'H is a reproducing kernel Hilbert space. Then, the inner product 
in T-L is provided by a known kernel function K(x„,Xm) := 99-^(x„)(^(xm) l[25l Ch. 3]. Typical kernels 
for vectorial input data are the polynomial, Gaussian, and the sigmoid ones; however, kernels can be 
defined for non-vectorial objects as well, such as strings or graphs ll25l . 

After having RKM tailored to the high-dimensional input data regime in Section |IV-A[ handling 
arbitrary kernel functions is now straightforward. Knowing the input data X and the kernel function 
k(x„, Xm), the kernel matrix can be readily computed as [K]„^m = ^(xn, x^) for n, m £ Nat. By using 
the kernel in lieu of the Grammian matrix. Algorithm |3] carries over readily to the nonlinear clustering 
regime. 
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As shown earlier, when clustering high-dimensional data, the centroids can be computed after the 
robust clustering algorithm has terminated via (l3Ti . This is not generally the case withrobust nonlinear 
clustering. For infinite-dimensional or even finite dimensional feature spaces, such as the one induced 
by a polynomial kernel, even if one is able to recover the centroid rric € T-L, its pre-image in the input 
space may not exist 1251 Ch. 18]. 

C. Kernelized Robust Probabilistic Clustering 

Similar to RKM, the RPC algorithm can be kernelized to (i) facilitate computationally efficient 
clustering of high-dimensional input data, and (ii) enable nonlinearly separable clustering. To simplify 
the presentation, the focus here is on the case of spherical clusters. 

Kemelizing RPC hinders a major difference over the kemelization of RKM: the GMM and the RPC 
updates in Section ITlI-B I remain valid for {(^(x^) ^T-L} only when the feature space dimension P remains 
finite and known. The implication is elucidated as follows. First, updating the variance in (|27| | entails the 
underlying vector dimension p - which becomes P when it comes to kemelization. Second, the (outlier- 
aware) mixtures of Gaussians degenerate when it comes to modeling infinite-dimensional random vectors. 
To overcome this limitation, the notion of the empirical kernel map will be exploited |[25l Ch. 2.2.6]. 
Given the fixed set of vectors in X, instead of ip, it is possible to consider the empirical kernel map 
ip : W ^ defined as (^(x) := (K-'^/^)"^[k(xi, x) • • • ^(xAr, x)]-^, where K is the kernel matrix of 
the input data X, and the Moore-Penrose pseudoinverse. The feature space % induced by (p has 
finite dimensionality N. It can be also verified that (^^(x„)(^(xm) = (/7"'^(x„)93(xm) = K(x„,Xm) for all 
x„, Xm G X; hence, inner products in T-L are readily computable through k. 

In the kernelized probabilistic setup, vectors {(p{ii.n)}n=i assumed drawn from a mixture of C 
multivariate Gaussian distributions with common covariance S = ct^Iat for all clusters. The EM-based 
updates of RPC in Section IIII-BI remain valid after replacing the dimension p in (|27] ) by A^, and the 
input vectors {x„} by {(^(x„)} with the critical property that one only needs to know the inner products 
(^^(x„)(^(xm) stored as the {n,m)-t\\ entries of the kernel matrix K. The kemelization procedure is 
similar to the one followed for RKM: first, the auxiliary matrices A'^*^ B'^*^ and A^*) are introduced. 
By randomly initializing a^^\ t^^^^ € V, B^*^) G M^^*^, and setting A^'^^ to zero, it can be shown as 
in Proposition m that the kemelized RPC updates for 0(*\ M^*), and R^*) have their columns lying in 
range($), where $ := [(^(xi) • • • (^(x^r)]. Instead of the assignment matrix U in KRKM, the N x C 
matrix of posterior probability estimates F^*) is used, where [r(*)]„^c := Tni satisfying T^^^lc = In Vt. 



January 20, 2013 



DRAFT 



20 



Algorithm 4 Kernelized RPC 



Require: Grammian or kernel matrix K ^ 0, number of clusters C, and A > 0. 

1: Randomly initialize cj(°\ 7r(°) G V, and B(°); and set A(°) to zero. 

2: for t = 1,2, . . . do 

3: Update r(*) via dlOll exploiting ||x„ - m^*"^^ - oi*"^^ ||2 = ||e„ - - ai*"^^ W]^ for all n, c. 

4: Update ttW as tt^ = {T^^'^Y'In/N. 

5: Update BW as B(*) = (Itv - A(*-i))rW diag-i(iV7rW). 

6: Update AW as =In - B(*)(r(*))^. 



7: Update the columns of A^*) as oon' = di*^ 



for all n. 



II^S,"I|: 

Update 0"'^*^ via ( [27] ) where p is replaced by N, using the ^2-norms computed in Step 3, and 
exploiting ||on^||2 = ||Q;n^||K for all n. 



9: end for 



The kernelized RPC (KRPC) algorithm is summarized as Alg. HI As with KRKM, its computations 
are 0{N^C) per iteration, whereas the stored variables A, B, A, F, tt, K, and a occupy 0{N'^) space. 

Remark 5 (Reweighted kernelized algorithms). Reweighted kernelized algorithms similar to the ones in 
Section IIII-CI can be derived for KRKM and KRPC by a simple modification. In both cases, it suffices 
to introduce an iteration-dependent parameter = A/(||on ^''ib + e) per index n. Note that ||o„||2's 
can be readily computed in terms of kernels as shown earlier. 

V. Numerical Tests 

Numerical tests illustrating the performance of the novel robust clustering algorithms on both synthetic 
and real datasets are shown in this section. Performance is assessed through their ability to identify outliers 
and the quality of clustering itself. The latter is measured using the adjusted rand index (ARI) between 
the partitioning found and the true partitioning of the data whenever the latter is available Wf\ . In each 
experiment, the parameter A is tuned using the grid search outlined in Remark |4] over at most 1,000 
values. Thanks to the warm-start technique, the solution path for all grid points was computed in an 
amount of time comparable to the one used for solving for a specific value of A. 

A. Synthetic Datasets 



Two synthetic datasets are used. The first one, shown in Fig. 1(a) consists of a random draw of 200 



vectors from C = 4 bivariate Gaussian distributions (50 vectors per distribution), and 80 outlying vectors 
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(a) Dataset with C = 4 spherical clusters. 
Fig. 1. Synthetic datasets: In-(out-)lier vectors are denoted 
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(b) Dataset with C = 2 concentric rings, 
by circles • (stars -k). 



{N = 280). The Gaussian distributions have different means and a common covariance matrix O.8I2. 
The second dataset comprises points belonging to C = 2 concentric rings as depicted in Fig. |l(b)| The 
inner (outer) ring has 50 (150) points. It also contains 60 vectors lying in the areas between the rings and 
outside the outer ring corresponding to outliers {N = 260). Clustering this second dataset is challenging 
even if outliers were not present due to the shape and multiscale nature of the clusters. 

Starting from the dataset with the four spherical clusters, the effect of A on the number of outliers 
identified is investigated. In Fig. |2l the values of {||o„||2}^=x plotted as a function of A. The outlier- 
norm curves shown in Fig. |2(a) [ correspond to the RKM algorithm with q = 1 using a random initialization. 
For A > 17, all o„'s are set to zero, while as A approaches zero, more o„'s take nonzero values. Selecting 
A G [6.2, 7.6] yields 80 outliers. Fig. [2(b)] shows {Ho^HJ^^^ as A varies for the RPC algorithm assuming 
Xl = cr^Ip. The curves for some o„'s exhibit a fast transition from zero. 

In Fig. [3l the number of outliers identified, i.e., the number of input vectors x„ with corresponding 
||on||2 > 0, is plotted as a function of A. Proper choice of A enables both RKM and RPC to identify exactly 
the 80 points, which were generated as outliers in the "ground truth" model. For the RKM algorithm with 
q = 1, there is a plateau for values of A G [6.2, 7.6]. This plateau defines a transition region between the 
values of A that identify true outliers and values of A which erroneously deem non-outliers as outliers. 
Although the plateau is not present for q > 1, the curves show an increase in their slope for A < 5 
indicating that non-outliers are erroneously labeled as outliers. RPC with A = 0.91 correctly identifies 
the 80 outlying vectors. Notice that the range of A values for which outliers are correctly identified is 
smaller than the one for RKM due to the scaling of A by a. 
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Table U shows the root-mean-squared error (RMSE) of cluster center estimates averaged over 100 
algorithm initializations. Two levels of outlier contamination are considered: 40 out of = 240 points 
(approximately 17%), and 80 out of = 280 (approximately 29%). Apart from the novel algorithms, 
hard K-means, soft K-means with q = 1.5, and EM are also included. The robust versions of the hard 
K-means, soft K-means, and EM algorithms achieve lower RMSE with extra improvement effected by 
the weighted RKM (WRKM) and the weighted RPC (WRPC) algorithms. 

Next, we consider clustering the second nonlinearly separable dataset using the Gaussian kernel 
K(x„,Xm) = exp(— Qk||x„ — Xmllj), whcrc Ok > is a scaling parameter. The parameter is 
chosen as a robust variance estimate of the entire dataset as described in (61. Both KRKM and KRPC 
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TABLE I 

RMSE OF CLUSTER CENTER ESTIMATES. 





RMSE 


Outliers/TV 


40/240 


80/280 


hard K-means 


0.7227 


1.0892 


soft K-means 


0.5530 


1.0206 


EM 


0.6143 


1.0032 


hard RKM 


0.4986 


0.8813 


hard WRKM 


0.4985 


0.8812 


soft RKM 


0.2587 


0.6259 


soft WRKM 


0.0937 


0.1758 


RFC 


0.2789 


0.3891 


WRPC 


0.1525 


0.1750 




(a) KRKM algorithm. (b) KRPC algorithm. 

Fig. 4. Number of outliers identified as a function of A for the dataset in Fig. |l(b)| 



are able to identify the 60 outlying points. In Fig. |4l the number of outliers identified by KRKM and 
KRPC is plotted as a function of A for different values of a^- Fig. [5] illustrates the values of ||o„ ||2's for 
WKRKM and WKRPC when seeking 60 outliers. Points surrounded by a circle correspond to vectors 
identified as outliers, and each circle's radius is proportional to its corresponding ||o„||2 value. 



January 20, 2013 



DRAFT 



24 




(a) KRKM algorithm {q = 1.1). (b) KRPC algorithm. 

Fig. 5. Clustering results for the dataset in Fig. 1 1(b) [ using a Gaussian kernel with = 0.2. Points suiTounded by a circle 
were deemed as outliers; the radius of the circle is proportional to the value of ||o„||2. Smallest and largest ||o„||2 values are 
shown. 

B. USPS Dataset 

In this subsection, the robust clustering algorithms are tested on the United States Postal Service 
(USPS) handwritten digit recognition corpus. This corpus contains gray-scale digit images of 16 x 16 
pixels with intensities normalized to [—1, 1]. It is divided to 7,201 training and 2,007 test examples of 
the digits 0-9. Although the corpus contains class labels, they are known to be inconsistent: some digits 
are erroneously labeled, while some images are difficult to be classified even by humans |[25l App. A]. In 
this experiment, the subset of digits 0-5 is used. For each digit, both training and test sets are combined 
to a single set and then 300 images are sampled uniformly at random, yielding a dataset of 1800 images. 
Each image is represented as a 256-dimensional vector normalized to have unit ^2-norm. 

Hard RKM {q = 1) and RPC algorithms are used to partition the dataset into C = 6 clusters and 
identify s = 100 outliers. All algorithms were tested for 20 Monte Carlo runs with random initializations 
common to all algorithms. The final partitioning is chosen as the one attaining the smallest cost in ([5]). 
The quality of the clustering is assessed through the ARI after excluding the outlier vectors. The ARI 
values for K-means, K-medians, and the proposed schemes are shown in Table |lll Note that the ARI 
values for RKM (RPC) and WRKM (WRPC) are equal. This indicates that the weighted algorithms do 
not modify the point-to-cluster assignments already found. Interestingly, the K-medians algorithm was 
not able to find a partitioning of the data revealing the 6 digits present, even after 100 Monte Carlo runs. 

The USPS dataset was clustered using the RKM and WRKM tuned to identify 100 outUers. WRKM 
is initialized with the results obtained by RKM. Although RKM and WRKM yielded the same outlier 
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TABLE 11 

ARI COEFFICIENTS FOR THE USPS DATASET (C = 6) 



Kernel 


K-means 


K-medians 


RKM 


RFC 


Linear 


0.6469 


0.5382 


0.6573 


0.6508 


Polynomial 


0.5571 




0.6978 


0.6965 



images, the size of the o„'s was different, becoming nearly uniform for WRKM. The USPS dataset was 



also clustered using the RPC and the WRPC algorithms. Fig. |6(a)| shows the cluster centroids obtained 
by RPC and WRPC. Fig. |6(b)| shows the 100 oudiers identified. The oudiers identified by the RPC and 
WRPC algorithms also coincide. The position of the outlier images in the mosaic corresponds to their 
ranking according to the size of their corresponding o„ (largest to smallest from left to right, top to 
bottom). Note that all outliers identified have a trait that differentiates them from the average image in 
each cluster. Among the 100 outliers detected by RKM and RPC, 97 are common to both approaches. 

A kemelized version of the algorithms was also used on the USPS dataset. Similar to [251, the 
homogeneous polynomial kernel of order 3, that is K(x.„,Xm) = (x^x^)^, was used. The ARI scores 
obtained by the kemelized robust clustering algorithms are shown in Table JIl Based on these scores, two 
important observations are in order: (i) kemelized K-means is more sensitive to outhers than K-means is; 
but (ii) KRKM for the particular kemel yields an improved clustering performance over RKM. Finally, 
the 100 outliers identified by KRKM are shown in Fig. |6(c)[ 

C. Dolphin 's Social Network 

Next, KRKM is used to partition and identify outhers in a social network of = 62 bottlenose 
dolphins living in Doubtful Sound, New Zealand |[23l . Links between pairs of nodes (dolphins) represent 
statistically significant frequency association. The network structure is summarized by the N x N 
adjacency matrix E. To identify social groups and outliers, the connection between kemel K-means and 
spectral clustering for graph partitioning is exploited ||27l . According to this connection, the conventional 
spectral clustering algorithm is substituted by the kemelized K-means algorithm with a specific kemel 
matrix. The kemel matrix used is K = i/I^ + D~^/^ED~^/^, where D := diag(El7v) and z> is chosen 
larger than the minimum eigenvalue of D~^/^ED~^/^ such that K ^ 0. 

Kemel K-means for graph partitioning is prone to being trapped at poor local minima, depending on 
initialization fTOl . The KRKM algorithm with C = 4 clusters is initialized by the spectral clustering 
solution using the symmetric Laplacian matrix L := I^r — D^^/^ED~^/^. The parameter A is tuned to 
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identify s = 12 outliers. Fig. [T] depicts the network partitioning and the outliers identified. The results 
show that several nodes identified as outliers have unit degree. Having a single link indicates that nodes 
are marginally connected to their corresponding clusters thus deemed as outliers. 

Other outlier instances adhere to more complicated structures within the network. Node zig has a single 
link, yet it is not identified as an outlier possibly due to the reduced size of its cluster, especially since four 
other nodes in the same cluster are identified as outliers. Interestingly, nodes sn89, snlOO, tr99 and kringel, 
with node degrees 2, 7, 7, and 9, respectively, are also identified as outliers. Using gender information 
about the dolphins |[23l . we observe that sn89 is a female dolphin assigned to a cluster dominated by 
males (10 males, 3 females, and 2 unobserved). Likewise, the connectivity of snlOO and tr99 in the graph 
shows that they share many edges with female dolphins in other clusters which differentiates them from 
other female dolphins within the same cluster. Finally, kringel is a dolphin connected to 6 dolphins in 
other clusters and only 3 dolphins in its own cluster. 

D. College Football Network 

KRKM is used to partition and identify outliers in a network of A'^ = 115 college football teams. The 
college football network represents the schedule of Division I games for the season in year 2,000 ||T3l . 
Each node corresponds to a team and a link between two teams exists if they played against each other 
during the season. The teams are divided into C = 12 conferences and each team plays games with teams 
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Cross 




Fig. 7. KRKM clustering of the dolphin's social network: outliers are depicted bold-faced; male, female, and unobserved 
gender are represented by squares, circles, and triangles, respectively. 
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Fig. 8. The kernel matrix for the college football network permuted using KRKM clustering. Zero entries are colored blue 
and outliers are colored red. 

in the same conference more often. KRKM is initialized via spectral clustering as described in Section 
IV-CI while A is tuned to identify s = 12 outliers. Fig. [8] shows the entries of the kernel matrix K after 
being row and column permuted so that teams in the same cluster obtained by KRKM are consecutive. 
The ARI coefficient yielded by KRKM after removing the outliers was 0.9218. 

Teams identified as outliers sorted in descending order based on their ||o„||2 values are: Connecticut, 




20 40 60 80 100 

Teams 
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Navy, Notre Dame, Northern Illinois, Toledo, Miami (Ohio), Bowling Green State, Central Michigan, 
Eastern Michigan, Kent, Ohio, and Marshall. Three of them, namely Connecticut, Notre Dame, and Navy, 
are independent teams. Connecticut is assigned to the Mid-American conference, but it does not play 
as many games with teams from this conference (4 games) as other teams in the same conference do 
(around 8 games). Notre Dame and Navy play an equal number of games with teams from two different 
conferences so they could be assigned to either one. Several teams from the Mid-American conference 
are categorized as outliers. In hindsight, this can be explained by the subdivision of the conference into 
East and West Mid-American conferences. Teams in each of the Mid-American sub-conferences played 
about the same number of games with teams from their own sub-conference and the rest of the teams. 
Interestingly, using KRKM with C = 13 while still seeking for 12 outliers, the sub-partition of the 
Mid-American conference is identified. In this case, the ARI coefficient for the partition after removing 
outliers is 0.9110. The three independent teams, Connecticut, Notre Dame, and Navy, are again among 
the 12 outliers identified. 
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VI. Conclusions 

Robust algorithms for clustering based on a principled data model accounting for outliers were devel- 
oped. Both deterministic and probabilistic partitional clustering setups based on the K-means algorithm 
and GMM's, respectively, were considered. Exploiting the fact that outliers appear infrequently in the data, 
a neat connection with sparsity-aware signal processing algorithms was made. This led to the development 
of computationally efficient and provably convergent robust clustering algorithms. Kemelized versions of 
the algorithms, well-suited for high-dimensional data or when only similarity information among objects 
is available, were also developed. The performance of the robust clustering algorithms was validated via 
numerical experiments both on synthetic and real datasets. 
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